MIT License
Copyright (c) 2020 Abhishek Singh Sambyal, Ashish Kaushal, Poojith U. Rao
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Given a task where we have sufficient labeled data, supervised learning methods have shown very promising results and proven to be a competent approach in solving various problems [1]. Efforts have been made to scale-up the use of these methods but sometimes the true cost of labeled data is significantly high. Therefore, considering the enormity of unlabeled data in comparison to the human annotated data, a natural way to handle this problem is using the unsupervised learning approach which aims on using the unlabeled data for performing the task.
Many decades have passed since the origin of unsupervised learning but it hasn’t yet been able to showcase the possessed hidden potential. After all, to know what to do, we must know what to do. For example, if we want to identify an object from an image we must know what the object looks like, what are the properties of that object, and how it is different from other similar objects etc.
In the field of computer vision, many models performing the tasks like semantic segmentation, object detection or object recognition have made significant progress so as to compete with humans on complex visual benchmarks. Unfortunately, the achievement of these models is mainly restricted to the amount of labeled data and also they are customized to specific scenarios. For example, a model trained on an ImageNet dataset to perceive road traffic at daylight might not work impeccably at night [2, 3].
As a result, a significant amount of research is focussing on models which adapt to new environmental conditions without leveraging the usual way of large expensive supervision. This includes advancement in transfer learning, self-supervised learning, semi-supervised learning, domain adaptation or weakly supervised learning.
Luckily, in the domain of text, an automatic supervisory signal for learning representation which we can use is context [4, 5, 6, 7]. If given with a large corpus of text, we can train a model which maps the words of that corpus to a feature vector in such a way that we predict the words which are supposed to occur before or after that specific word. It helps us to convert an unsupervised learning problem to a self-supervised learning one. This context prediction is known as the “pretext task” which provides our model an improved understanding and forces it to learn the word embedding in a substancial way. This improves the accuracy of our model and helps us in solving many real world tasks [6]. When it comes to image data, we can’t use the context as a pretext task in such a plain and straightforward manner. It turns out that there are many pretext tasks available for image data out of which we can choose any of them. Colorization, image patch prediction, classification of corrupt images, inpainting, placing frames in correct order, or image rotation are few among them. Each of the described tasks are state-of-the-art techniques providing significant results depending upon the dataset on which they are applied.
In our work, we aim to provide a self-supervised formulation by predicting the context of patch as a pretext task. In this task, we sample a random pair of patches from 8 patches of an image and try to predict the position of one patch with respect to the other one. We presented an ConvNet based architecture for pair classification and trained our model on a mini-ImageNet dataset.
Self-supervised is a learning framework in which a pretext task is specified for learning the true representations/features in such a way that it will help later in solving the real world downstream tasks. Since it is a generic framework, it can be used in a wide variety of applications like robotics and computer vision.
In robotics, multiple perception modalities and results of interacting with the world are signals which can be used for creation of self-supervised tasks [8, 9, 10]. Similarly, when learning representation from videos, one can use consistency [11] or audio, video and subtitles [12, 13, 14] synchronized cross modality in the temporal dimension.
However, when thinking of an ideal image representation in terms of latent variables of a generative model, we want the model to generate images in their natural distribution and be concise of common causes for different images and share information between them. But the issue of inference with latent structures given an image is highly dubious. To deal with this computational issue, many compositions such as contrastive divergence [15], deep Boltzman machines [16], wake-sleep algorithm [17], variation of Bayesian methods [18, 19] have been proposed. Generative models have high performance on smaller datasets like handwritten digits [15, 16, 18, 19] but do not work well with natural images of high resolution.
In unsupervised representation learning, to learn the embeddings of images in a true sense, one way is to create a supervised pretext task in such a way that the labels are part of input data. In such a case, our model will be able to learn the embeddings from the data itself which will be useful for other real world tasks. For example, denoising autoencoders [20, 21] use reconstruction from noisy data as a pretext task. The algorithm ensured this by identifying the objects in the image and classifying whether the object is a signal or a noise. Sparse autoencoders also use reconstruction as a pretext task along with a sparsity penalty. Such autoencoders are stacked to form deep representations [22, 23].
Another pretext task is “context prediction”. This task has been widely used in the domain of Natural Language Processing where it is applied on text. Skip-gram models [6] have been used to train a model (deep network) to predict, from a single word, the n preceding and n succeeding words which generate useful word representations. Similar motivation can be used in image domain but the problem still persists as we can’t determine whether the predictions are correct [24] or not, unless we care about predicting the low level features [25, 26, 27]. To tackle this issue, [28] designed an approach which predicts the appearance of an image region by voting (consensus) of transitive nearest neighbours surrounding it. The major issues that all these approaches address is that predicting pixels is always harder than predicting text unless we use an unorthodox approach which deals with these issues differently.
!ls /gdrive/My\ Drive/training_models
from google.colab import drive
drive.mount('/gdrive')
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import models, datasets
import torchvision
from torchvision import transforms
from torchvision import models
import torch.nn.functional as F
import torchvision.transforms.functional as TF
import albumentations
from PIL import Image
import numpy as np
import os
import matplotlib.pyplot as plt
import random
import time
import nibabel as nib
from tqdm import tqdm
import pandas as pd
import skimage
from skimage import img_as_ubyte, img_as_float32
from sklearn.model_selection import StratifiedShuffleSplit
from glob import glob
np.random.seed(108)
plt.style.use('default')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
visualize = False
class Config():
ROOT = 'tiny-imagenet-200'
TRAIN_PATH = 'tiny-imagenet-200/train'
VAL_PATH = 'tiny-imagenet-200/val'
TEST_PATH = 'tiny-imagenet-200/test'
subset_data = 1000
patch_dim = 15
gap = 3
batch_size = 64
num_epochs = 65
lr = 0.0005
def imshow(img,text=None,should_save=False):
plt.figure(figsize=(10, 10))
npimg = img.numpy()
plt.axis("off")
if text:
plt.text(75, 8, text, style='italic',fontweight='bold',
bbox={'facecolor':'white', 'alpha':0.8, 'pad':10})
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
def show_plot(iteration,loss,fname):
plt.plot(iteration,loss)
plt.savefig(fname)
plt.show()
"""
Args:
tensor (Tensor): Tensor image of size (C, H, W) to be normalized.
Returns:
Tensor: Normalized image.
"""
class UnNormalize(object):
def __init__(self, mean, std):
self.mean = mean
self.std = std
def __call__(self, tensor):
for t, m, s in zip(tensor, self.mean, self.std):
t.mul_(s).add_(m)
return tensor
unorm = UnNormalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))
# unorm(tensor)
def convert_format(data, format):
if format == 'p':
return np.transpose(data, (0, 3, 1, 2))
if format == 'n':
return np.transpose(data, (0, 2, 3, 1))
if format == '3':
return np.transpose(data, (1, 2, 0))
In our experimentation, we used the Tiny ImageNet dataset. It contains 1,00,000 images as training set, 10,000 images as validation set, 10,000 images as test set. The source of images are the 200 different classes of objects. Originally the experiments were to be performed on the training set of ImageNet dataset (consisting of 1.3M images) but due to limited computation power, we decided to perform the task on the former. To further reduce the computation, we performed stratification on the dataset and chose 10 classes out of available 200 classes and then selected 4000 images as our final training set such that the proportion per class remains the same. The images in this dataset are downscaled from the original ImageNet dataset size of 256x256 to 64x64.
!wget http://cs231n.stanford.edu/tiny-imagenet-200.zip
!unzip -q tiny-imagenet-200.zip
In each image for computational efficiency, instead of working on all the patches, we sample the patches in a grid-like structure such that each patch can pair with maximum 8 different patches throughout the image. We sample patches at resolution of 15x15. To ensure that the model does not choose the easy way out using the object boundaries and lines, we left a gap of 3 pixels between two consecutive patches on all patch facing sides. The preprocessing of the patches includes mean subtraction, randomly downsampling the pixels of the some patches and then upsampling them, and projecting and dropping colors. After that, we upsampled the images to 96x96 for further usage.
#############################
# Creating training dataset
#############################
df_list = []
classes = os.listdir(Config.TRAIN_PATH)
for idx, each_class in enumerate(classes):
images_in_each_class = glob(f'{Config.TRAIN_PATH}/{each_class}/**/*.JPEG')
df_list += [[each_image, each_class] for each_image in images_in_each_class]
df = pd.DataFrame(data=df_list, columns=['filename', 'class'])
# Taking the classes subset
num_training_classes_subset = 10
train_classes_used = df['class'].unique()[:num_training_classes_subset]
df = df[df['class'].isin(train_classes_used)]
# df.groupby('class').count()
X, y = df['filename'], df['class']
ratio = Config.subset_data/len(X)
sss = StratifiedShuffleSplit(n_splits=5, train_size=ratio, random_state=0)
sss.get_n_splits(X, y)
print(sss)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", len(train_index), "TEST:", len(test_index))
stratified1000trn = train_index
break
df_trn = df.iloc[stratified1000trn].reset_index(drop=True)
df_trn.head()
#############################
# Creating validation dataset
#############################
df = pd.read_csv('tiny-imagenet-200/val/val_annotations.txt',
header=None,
names=['filename', 'class', '_1', '_2', '_3', '4'],
delim_whitespace=True)
df.drop(['_1', '_2', '_3', '4'], axis = 1, inplace=True)
# Using only those classes in the dataset which are used in the training
df = df[df['class'].isin(train_classes_used)]
X, y = df['filename'], df['class']
sss = StratifiedShuffleSplit(n_splits=5, train_size=0.2, random_state=0)
sss.get_n_splits(X, y)
print(sss)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", len(train_index), "TEST:", len(test_index))
stratified1000val = train_index
break
df_val = df.iloc[stratified1000val].reset_index(drop=True)
df_val['filename'] = 'tiny-imagenet-200/val/images/' + df_val['filename']
df_val[['filename', 'class']].head()
#########################################
# This class generates patches for training
#########################################
class MyDataset(Dataset):
def __init__(self, patch_dim, gap, df, validate, transform=None):
self.patch_dim, self.gap = patch_dim, gap
self.transform = transform
if validate:
self.train_data = df.values
else:
self.train_data = df.values
def get_patch_from_grid(self, image, patch_dim, gap):
image = np.array(image)
offset_x, offset_y = image.shape[0] - (patch_dim*3 + gap*2), image.shape[1] - (patch_dim*3 + gap*2)
start_grid_x, start_grid_y = np.random.randint(0, offset_x), np.random.randint(0, offset_y)
patch_loc_arr = [(1, 1), (1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2), (3, 3)]
loc = np.random.randint(len(patch_loc_arr))
tempx, tempy = patch_loc_arr[loc]
patch_x_pt = start_grid_x + patch_dim * (tempx-1) + gap * (tempx-1)
patch_y_pt = start_grid_y + patch_dim * (tempy-1) + gap * (tempy-1)
random_patch = image[patch_x_pt:patch_x_pt+patch_dim, patch_y_pt:patch_y_pt+patch_dim]
patch_x_pt = start_grid_x + patch_dim * (2-1) + gap * (2-1)
patch_y_pt = start_grid_y + patch_dim * (2-1) + gap * (2-1)
uniform_patch = image[patch_x_pt:patch_x_pt+patch_dim, patch_y_pt:patch_y_pt+patch_dim]
random_patch_label = loc
return uniform_patch, random_patch, random_patch_label
def __len__(self):
return len(self.train_data)
def __getitem__(self, index):
image = Image.open(self.train_data[index]).convert('RGB')
uniform_patch, random_patch, random_patch_label = self.get_patch_from_grid(image,
self.patch_dim,
self.gap)
if uniform_patch.shape[0] != 96:
uniform_patch = skimage.transform.resize(uniform_patch, (96, 96))
random_patch = skimage.transform.resize(random_patch, (96, 96))
uniform_patch = img_as_float32(uniform_patch)
random_patch = img_as_float32(random_patch)
# Dropped color channels 2 and 3 and replaced with gaussian noise(std ~1/100 of the std of the remaining channel)
uniform_patch[:, :, 1] = np.random.normal(0.485, 0.01 * np.std(uniform_patch[:, :, 0]), (uniform_patch.shape[0],uniform_patch.shape[1]))
uniform_patch[:, :, 2] = np.random.normal(0.485, 0.01 * np.std(uniform_patch[:, :, 0]), (uniform_patch.shape[0],uniform_patch.shape[1]))
random_patch[:, :, 1] = np.random.normal(0.485, 0.01 * np.std(random_patch[:, :, 0]), (random_patch.shape[0],random_patch.shape[1]))
random_patch[:, :, 2] = np.random.normal(0.485, 0.01 * np.std(random_patch[:, :, 0]), (random_patch.shape[0],random_patch.shape[1]))
random_patch_label = np.array(random_patch_label).astype(np.int64)
if self.transform:
uniform_patch = self.transform(uniform_patch)
random_patch = self.transform(random_patch)
return uniform_patch, random_patch, random_patch_label
##################################################
# Creating Train/Validation dataset and dataloader
##################################################
traindataset = MyDataset(Config.patch_dim, Config.gap, df_trn['filename'], False,
transforms.Compose([transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])]))
trainloader = torch.utils.data.DataLoader(traindataset,
batch_size=Config.batch_size,
shuffle=True,
# num_workers=Config.num_workers
)
valdataset = MyDataset(Config.patch_dim, Config.gap, df_val['filename'], True,
transforms.Compose([transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])]))
valloader = torch.utils.data.DataLoader(valdataset,
batch_size=Config.batch_size,
shuffle=False)
##############################
# Visualizing training dataset
##############################
example_batch = next(iter(trainloader))
concatenated = torch.cat((unorm(example_batch[0]),unorm(example_batch[1])),0)
imshow(torchvision.utils.make_grid(concatenated))
print(f'Labels: {example_batch[2].numpy()}')
##############################
# Visualizing validation dataset
##############################
example_batch_val = next(iter(valloader))
concatenated = torch.cat((unorm(example_batch_val[0]),unorm(example_batch_val[1])),0)
imshow(torchvision.utils.make_grid(concatenated))
print(f'Labels: {example_batch_val[2].numpy()}')
For our pretext task of predicting the relative position of patches, we aim to learn the image representation suitably. We tend to use Convolution Neural Networks (ConvNets) which are well known for their complex image representation with only essential human design. Out of the generated 9 patches from an image, we choose 2 patches in such a way that the 1st chosen patch is always the middle one and the 2nd can be any of the other 8 spatial configurations. Building the ConvNet which predicts the relative position of the 2nd patch with respect to the 1st one requires feeding the two selected patches as input through multiple convolution layers and producing an output that assigns probability to each of the remaining eight patches (softmax output). We ultimately want individual patches to learn the feature embeddings in such a way that patches across different images which are visually similar to the patch remain close in the embedding space. For this, we use an AlexNet-style late-fusion architecture which processes the patch individually until it reaches the depth analogous to fc6 in AlexNet. After this layer the two patches are fused and then processed together. For the layers processing the single patch, weights are tied between both sides of the network such that for both the patches compute the same fc6-level embedding function. As only two layers can receive input from both the patches, the capacity for joint reasoning is very limited. Due to this, the network is expected to perform the bulk semantic reasoning individually.
class AlexNetwork(nn.Module):
def __init__(self,aux_logits = False):
super(AlexNetwork, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(3, 96, kernel_size=11, stride=4),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.LocalResponseNorm(96),
nn.Conv2d(96, 384, kernel_size=5, stride = 2,padding = 2),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.LocalResponseNorm(384),
nn.Conv2d(384, 384, kernel_size=3, stride=1,padding = 1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(384),
nn.Conv2d(384, 384, kernel_size=3, stride=1,padding = 1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(384),
nn.Conv2d(384, 256, kernel_size=3, stride=1,padding = 1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(256),
nn.MaxPool2d(kernel_size=3, stride=2,padding = 1),
)
self.fc6 = nn.Sequential(
nn.Linear(256,4096),
nn.ReLU(inplace=True),
nn.BatchNorm1d(4096),
)
self.fc = nn.Sequential(
nn.Linear(2*4096,4096),
nn.ReLU(inplace=True),
nn.Linear(4096, 4096),
nn.ReLU(inplace=True),
nn.Linear(4096, 8)
)
def forward_once(self, x):
output= self.cnn(x)
output = output.view(output.size()[0], -1)
output = self.fc6(output)
return output
def forward(self, uniform_patch, random_patch):
output_fc6_uniform = self.forward_once(uniform_patch)
output_fc6_random = self.forward_once(random_patch)
output = torch.cat((output_fc6_uniform,output_fc6_random), 1)
output = self.fc(output)
return output, output_fc6_uniform, output_fc6_random
model = AlexNetwork().to(device)
#############################################
# Initialized Optimizer, criterion, scheduler
#############################################
optimizer = optim.Adam(model.parameters(), lr=Config.lr)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
mode='min',
patience=5,
factor=0.3, verbose=True)
############################
# Training/Validation Engine
############################
global_trn_loss = []
global_val_loss = []
# previous_val_loss = 100
for epoch in range(Config.num_epochs):
train_running_loss = []
val_running_loss = []
start_time = time.time()
model.train()
for idx, data in tqdm(enumerate(trainloader), total=int(len(traindataset)/Config.batch_size)):
uniform_patch, random_patch, random_patch_label = data[0].to(device), data[1].to(device), data[2].to(device)
optimizer.zero_grad()
output, output_fc6_uniform, output_fc6_random = model(uniform_patch, random_patch)
loss = criterion(output, random_patch_label)
loss.backward()
optimizer.step()
train_running_loss.append(loss.item())
else:
correct = 0
total = 0
model.eval()
with torch.no_grad():
for idx, data in tqdm(enumerate(valloader), total=int(len(valdataset)/Config.batch_size)):
uniform_patch, random_patch, random_patch_label = data[0].to(device), data[1].to(device), data[2].to(device)
output, output_fc6_uniform, output_fc6_random = model(uniform_patch, random_patch)
loss = criterion(output, random_patch_label)
val_running_loss.append(loss.item())
_, predicted = torch.max(output.data, 1)
total += random_patch_label.size(0)
correct += (predicted == random_patch_label).sum()
print('Val Progress --- total:{}, correct:{}'.format(total, correct.item()))
print('Val Accuracy of the network on the 10000 test images: {}%'.format(100 * correct / total))
global_trn_loss.append(sum(train_running_loss) / len(train_running_loss))
global_val_loss.append(sum(val_running_loss) / len(val_running_loss))
scheduler.step(global_val_loss[-1])
print('Epoch [{}/{}], TRNLoss:{:.4f}, VALLoss:{:.4f}, Time:{:.2f}'.format(
epoch + 1, Config.num_epochs, global_trn_loss[-1], global_val_loss[-1],
(time.time() - start_time) / 60))
if epoch % 20 == 0:
MODEL_SAVE_PATH = f'/gdrive/My Drive/model_{Config.batch_size}_{Config.num_epochs}_{Config.lr}_{Config.subset_data}_{Config.patch_dim}_{Config.gap}.pt'
torch.save(
{
'epoch': Config.num_epochs,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
'global_trnloss': global_trn_loss,
'global_valloss': global_val_loss
}, MODEL_SAVE_PATH)
checkpoint = torch.load('/gdrive/My Drive/training_models/model_colab_300.pt', map_location='cuda')
plt.plot(range(len(checkpoint['global_trnloss'])), checkpoint['global_trnloss'], label='TRN Loss')
plt.plot(range(len(checkpoint['global_valloss'])), checkpoint['global_valloss'], label='VAL Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Main Network Training/Validation Loss plot')
plt.legend()
plt.show()
To demonstrate how well the network learned semantic association with the visually similar patches of different images, we will use nearest-neighbour matching. As our intuition tells that the training should be convenient enough to provide similar representations to semantically similar patches. Therefore, our goal is to understand which of the following patches our network considers as similar. Initially we will randomly sample 96x96 patches of image which will be represented using fc6 features (and removing fc7 and higher layer features). Instead of using two stacks of features, we will use only one. By using the normalized correlations of these features, we will find the nearest neighbour. To comprehend the performance of this experiment, we compared our model with the fc7 features of AlexNet trained on ImageNet dataset using fc6 features from our architecture with random weight initialization. The result showed that the matches returned by our features captures the semantic information that we are after, which is similar to the semantic information captured by the AlexNet (in some cases). Interestingly, random weight initialization in ConvNet also does a pretty reasonable job.
!ls /gdrive/My\ Drive
# checkpoint = torch.load('/gdrive/My Drive/model_lr0.0005_Adam_epochs300', map_location='cuda')
# model.load_state_dict(checkpoint['model_state_dict'])
# checkpoint = torch.load('/gdrive/My Drive/training_models/model_64_300_0.0001_1000_15_3.pt', map_location='cuda')
# model.load_state_dict(checkpoint['model_state_dict'])
checkpoint = torch.load('/gdrive/My Drive/training_models/model_colab_300.pt', map_location='cuda')
model.load_state_dict(checkpoint['model_state_dict'])
!ls /gdrive/My\ Drive/training_models
checkpoint = torch.load('/gdrive/My Drive/training_models/model_64_300_0.0001_1000_15_3.pt', map_location='cuda')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
data_iter_1 = iter(valloader)
data_iter_2 = iter(valloader)
example_batch = next(data_iter_1)
vectors = []
for j, data in enumerate(trainloader,0):
img0, img1, label = data
label = label.reshape([-1])
img0, img1 , label = data[0].to(device), data[1].to(device), data[2].to(device)
output ,output1,output2= model(img0,img1)
img1 = img1.cpu().detach().numpy()
output2 = output2.cpu().detach().numpy()
for i in range(len(output2)):
vectors.append([img1[i],output2[i]])
#model_colab_300.pt
img0 , img1 , label = example_batch
label = label.reshape([-1])
img0, img1 , label = data[0].to(device), data[1].to(device), data[2].to(device)
output ,output1,output2= model(img0,img1)
output2 = output2.cpu().detach().numpy()
img1 = img1.cpu().detach().numpy()
for i in range(20):
vectors.sort(key=lambda tup: np.linalg.norm(tup[1]-output2[i]))
npimg = img1[i]
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(1,10,1)
plt.axis("off")
ax1.imshow(np.transpose(unorm(torch.tensor(npimg)), (1, 2, 0)))
for j in range(1,10):
ax1 = fig.add_subplot(1,10,j+1)
ax1.imshow(np.transpose(unorm(torch.tensor(vectors[j-1][0])), (1, 2, 0)))
plt.axis("off")
plt.show()
In the nearest neighbours experiment, we retrieve top k images similar to the input image. In order to do so, we first calculate and store the vector equivalent representations of all the images. Then for a given input image we calculate the euclidean distance to sort the vectors. Top k images are then plotted.
As we can see the images retrieved from the above method are indeed similar. We can notice that the images obtained follow a trend. The first observation being the color of the images are almost similar. The textures of the images are also similar.
For example, the first row has leaves in the image and most of the similar images also contain leaves. Similar images of sky also have portions of sky in them. White colored images have similar images with same color.
During the early nearest neighbour experimentation, we noticed that some patches retrieved patches from almost the same absolute location (bottom-right) in the image, regardless of the visual similarity or context. This was due to chromatic aberration in the patches. Chromatic Aberration arises when the lens focuses light at different wavelengths which creates the one color channel to focus at a particular position (center) in the image with respect to the other one. It turns out that the designed ConvNet can learn to localize a patch with respect to the lens simply by detecting the color separation between green and magenta. Once the network learns the absolute location of the lens, it uses the trivial solution to find location of the relative patch just by seeing the color when instead, the network should learn the true features of the patch to find the patches similar to it. To demonstrate this phenomenon, we trained a network to predict the absolute (x,y) coordinates of a patch sampled from Tiny Imagenet. The overall accuracy turns out to be reasonably high but it was pretty high for few top images.
!ls /gdrive/My\ Drive/training_models
checkpoint = torch.load('/gdrive/My Drive/model_lr0.0005_Adam_epochs300', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
def display_canvas(patch_bucket, coordinates, title):
x = np.linspace(8, 8, 100)
y = np.cos(x)
fig, ax = plt.subplots()
idx = 0
for i in reversed(range(3)):
for j in range(0, 9, 3):
ax.imshow(patch_bucket[idx], extent=[coordinates[i+j][0],
coordinates[i+j][0]+Config.patch_dim,
coordinates[i+j][1],
coordinates[i+j][1]+Config.patch_dim], aspect='auto')
idx += 1
ax.plot(x, y)
plt.title(title)
plt.show()
# display_canvas(patch_bucket, coordinates)
#############################
# Creating training dataset
#############################
Config.subset_data = 950
df_list = []
classes = os.listdir(Config.TRAIN_PATH)
for idx, each_class in enumerate(classes):
images_in_each_class = glob(f'{Config.TRAIN_PATH}/{each_class}/**/*.JPEG')
df_list += [[each_image, each_class] for each_image in images_in_each_class]
df = pd.DataFrame(data=df_list, columns=['filename', 'class'])
# Taking the classes subset
num_training_classes_subset = 2
train_classes_used = df['class'].unique()[:num_training_classes_subset]
df = df[df['class'].isin(train_classes_used)]
X, y = df['filename'], df['class']
ratio = Config.subset_data/len(X)
sss = StratifiedShuffleSplit(n_splits=5, train_size=ratio, random_state=0)
sss.get_n_splits(X, y)
print(sss)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", len(train_index), "TEST:", len(test_index))
stratified1000trn = train_index
break
df_trn = df.iloc[stratified1000trn].reset_index(drop=True)
df_trn.head()
#############################
# Creating validation dataset
#############################
df = pd.read_csv('tiny-imagenet-200/val/val_annotations.txt',
header=None,
names=['filename', 'class', '_1', '_2', '_3', '4'],
delim_whitespace=True)
df.drop(['_1', '_2', '_3', '4'], axis = 1, inplace=True)
# Using only those classes in the dataset which are used in the training
df = df[df['class'].isin(train_classes_used)]
X, y = df['filename'], df['class']
sss = StratifiedShuffleSplit(n_splits=5, train_size=0.2, random_state=0)
sss.get_n_splits(X, y)
print(sss)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", len(train_index), "TEST:", len(test_index))
stratified1000val = train_index
break
df_val = df.iloc[stratified1000val].reset_index(drop=True)
df_val['filename'] = 'tiny-imagenet-200/val/images/' + df_val['filename']
df_val[['filename', 'class']].head()
####################################
# Chromatic Aberration Dataset Class
####################################
class ChromaticAberrationDataset(Dataset):
def __init__(self, patch_dim, gap, df, validate, transform=None):
self.patch_dim, self.gap = patch_dim, gap
self.transform = transform
if validate:
self.train_data = df.values
else:
self.train_data = df.values
def get_patches_and_coordinates(self, image, patch_dim, gap):
patch_loc_arr = [(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3)]
patch_coordinates = []
offset_x, offset_y = image.shape[0] - (patch_dim*3 + gap*2), image.shape[1] - (patch_dim*3 + gap*2)
start_grid_x, start_grid_y = 9, 9
patch_bucket = np.empty([9, 3, 96, 96], dtype='float32')
for i, (tempx, tempy) in enumerate(patch_loc_arr):
tempx, tempy = patch_loc_arr[i]
patch_x_pt = start_grid_x + patch_dim * (tempx-1) + gap * (tempx-1)
patch_y_pt = start_grid_y + patch_dim * (tempy-1) + gap * (tempy-1)
patch_coordinates.append([patch_x_pt, patch_y_pt])
img_patch = image[patch_x_pt:patch_x_pt+patch_dim, patch_y_pt:patch_y_pt+patch_dim]
# Resizing the patch to 96x96
if img_patch.shape[0] != 96:
img_patch = skimage.transform.resize(img_patch, (96, 96))
img_patch = img_as_float32(img_patch)
patch_bucket[i] = np.transpose(img_patch, (2, 0, 1))
return patch_bucket, np.array(patch_coordinates)
def __len__(self):
return len(self.train_data)
def __getitem__(self, index):
image = np.array(Image.open(self.train_data[index]).convert('RGB'))
patch_bucket, coordinates = self.get_patches_and_coordinates(image, self.patch_dim, self.gap)
coordinates = coordinates.astype(np.float32)
return patch_bucket, coordinates
traindataset = ChromaticAberrationDataset(Config.patch_dim, Config.gap, df_trn['filename'], False,
transforms.Compose([transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])])
)
trainloader = torch.utils.data.DataLoader(traindataset,
batch_size=Config.batch_size,
shuffle=True,
)
valdataset = ChromaticAberrationDataset(Config.patch_dim, Config.gap, df_val['filename'], True,
transforms.Compose([transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])])
)
valloader = torch.utils.data.DataLoader(valdataset,
batch_size=Config.batch_size,
shuffle=False)
class ColorAbberationNetwork(nn.Module):
def __init__(self):
super(ColorAbberationNetwork, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(3, 96, kernel_size=11, stride=4),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.LocalResponseNorm(96),
nn.Conv2d(96, 384, kernel_size=5, stride = 2,padding = 2),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.LocalResponseNorm(384),
nn.Conv2d(384, 384, kernel_size=3, stride=1,padding = 1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(384),
nn.Conv2d(384, 384, kernel_size=3, stride=1,padding = 1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(384),
nn.Conv2d(384, 256, kernel_size=3, stride=1,padding = 1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(256),
nn.MaxPool2d(kernel_size=3, stride=2,padding = 1),
)
self.fc6 = nn.Sequential(
nn.Linear(256,4096),
nn.ReLU(inplace=True),
nn.BatchNorm1d(4096),
)
self.fc = nn.Sequential(
nn.Linear(4096, 2)
)
def forward(self, patch_bunch):
bs, _, _, _ = patch_bunch.shape
output = self.cnn(patch_bunch)
output = output.view(bs, -1)
output = self.fc6(output)
output = self.fc(output)
return output
modelCAN = ColorAbberationNetwork().to(device)
#######################################
# Loading the weights of the main model
#######################################
m = model.state_dict()
mc = modelCAN.state_dict()
for idx, k in enumerate(m.keys()):
print(idx, k)
if idx < 32:
mc[k].copy_(m[k])
print(f'{k} layer weights saved')
else:
print(f'{k} layer weights NOT saved')
# Freezing pretrained layers except the last fc layer
# Run this if you are loading the pretrained weights of the mail model
for param in modelCAN.cnn.parameters():
param.requires_grad = False
for param in modelCAN.fc6.parameters():
param.requires_grad = False
optimizer = optim.SGD(modelCAN.fc.parameters(), lr=0.01, momentum=0.9)
criterion = nn.MSELoss()
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
mode='min',
patience=5,
factor=0.3, verbose=True)
from tqdm.notebook import tqdm
global_trn_loss = []
global_val_loss = []
for epoch in range(Config.num_epochs):
train_running_loss = []
val_running_loss = []
start_time = time.time()
modelCAN.train()
for idx, data in tqdm(enumerate(trainloader), desc='Training', total=int(len(traindataset)/Config.batch_size)):
bs, ncrops, c, h, w = data[0].size()
bs, v1, v2 = data[1].size()
# Reshape ncrops into batch size
data[0] = data[0].view(-1, c, h, w)
data[1] = data[1].view(-1, v2)
data[0], data[1] = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
output = modelCAN(data[0])
loss = torch.sqrt(criterion(output, data[1]))
loss.backward()
optimizer.step()
train_running_loss.append(loss.item())
else:
modelCAN.eval()
with torch.no_grad():
for idx, data in tqdm(enumerate(valloader), desc='Validation', total=int(len(valdataset)/Config.batch_size)):
bs, ncrops, c, h, w = data[0].size()
bs, v1, v2 = data[1].size()
# Reshape ncrops into batch size
data[0] = data[0].view(-1, c, h, w)
data[1] = data[1].view(-1, v2)
data[0], data[1] = data[0].to(device), data[1].to(device)
output = modelCAN(data[0])
loss = torch.sqrt(criterion(output, data[1]))
val_running_loss.append(loss.item())
global_trn_loss.append(sum(train_running_loss) / len(train_running_loss))
global_val_loss.append(sum(val_running_loss) / len(val_running_loss))
scheduler.step(global_val_loss[-1])
print('Epoch [{}/{}], TRNLoss:{:.4f}, VALLoss:{:.4f}, Time:{:.2f}'.format(
epoch + 1, Config.num_epochs, global_trn_loss[-1], global_val_loss[-1],
(time.time() - start_time) / 60))
if epoch % 20 == 0:
MODEL_SAVE_PATH = f'/gdrive/My Drive/training_models/model_CA_base_bs{Config.batch_size}_epochs{Config.num_epochs}_lr{Config.lr}_sd{Config.subset_data}_pd{Config.patch_dim}_g{Config.gap}.pt'
print(f'Model Saved at {MODEL_SAVE_PATH}')
torch.save(
{
'epoch': Config.num_epochs,
'model_state_dict': modelCAN.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
'global_trnloss': global_trn_loss,
'global_valloss': global_val_loss
}, MODEL_SAVE_PATH)
plt.plot(range(len(global_trn_loss)), global_trn_loss, label='TRN Loss')
plt.plot(range(len(global_val_loss)), global_val_loss, label='VAL Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Chromatic Aberration Training/Validation Loss plot')
plt.legend()
plt.show()
# Freezing pretrained layers except the last fc layer
# Run this if you are loading the pretrained weights of the mail model
for param in modelCAN.cnn.parameters():
param.requires_grad = True
for param in modelCAN.fc6.parameters():
param.requires_grad = True
optimizer = optim.SGD(modelCAN.parameters(), lr=0.0001, momentum=0.9)
criterion = nn.MSELoss()
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
mode='min',
patience=5,
factor=0.3, verbose=True)
# global_trn_loss = []
# global_val_loss = []
for epoch in range(Config.num_epochs):
train_running_loss = []
val_running_loss = []
start_time = time.time()
modelCAN.train()
for idx, data in tqdm(enumerate(trainloader), desc='Training', total=int(len(traindataset)/Config.batch_size)):
bs, ncrops, c, h, w = data[0].size()
bs, v1, v2 = data[1].size()
# Reshape ncrops into batch size
data[0] = data[0].view(-1, c, h, w)
data[1] = data[1].view(-1, v2)
data[0], data[1] = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
output = modelCAN(data[0])
loss = torch.sqrt(criterion(output, data[1]))
loss.backward()
optimizer.step()
train_running_loss.append(loss.item())
else:
modelCAN.eval()
with torch.no_grad():
for idx, data in tqdm(enumerate(valloader), desc='Validation', total=int(len(valdataset)/Config.batch_size)):
bs, ncrops, c, h, w = data[0].size()
bs, v1, v2 = data[1].size()
# Reshape ncrops into batch size
data[0] = data[0].view(-1, c, h, w)
data[1] = data[1].view(-1, v2)
data[0], data[1] = data[0].to(device), data[1].to(device)
output = modelCAN(data[0])
loss = torch.sqrt(criterion(output, data[1]))
val_running_loss.append(loss.item())
global_trn_loss.append(sum(train_running_loss) / len(train_running_loss))
global_val_loss.append(sum(val_running_loss) / len(val_running_loss))
scheduler.step(global_val_loss[-1])
print('Epoch [{}/{}], TRNLoss:{:.4f}, VALLoss:{:.4f}, Time:{:.2f}'.format(
epoch + 1, Config.num_epochs, global_trn_loss[-1], global_val_loss[-1],
(time.time() - start_time) / 60))
if epoch % 20 == 0:
MODEL_SAVE_PATH = f'/gdrive/My Drive/model_CA_full_network_bs{Config.batch_size}_epochs{Config.num_epochs}_lr{Config.lr}_sd{Config.subset_data}_pd{Config.patch_dim}_g{Config.gap}.pt'
print(f'Model Saved at {MODEL_SAVE_PATH}')
torch.save(
{
'epoch': Config.num_epochs,
'model_state_dict': modelCAN.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
'global_trnloss': global_trn_loss,
'global_valloss': global_val_loss
}, MODEL_SAVE_PATH)
plt.plot(range(len(global_trn_loss)), global_trn_loss, label='TRN Loss')
plt.plot(range(len(global_val_loss)), global_val_loss, label='VAL Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Fine-Tune Training/Validation Loss plot')
plt.legend()
plt.show()
##########################
# Projection Dataset Class
##########################
class ProjectionDataset(Dataset):
def __init__(self, patch_dim, gap, df, validate, transform=None):
self.patch_dim, self.gap = patch_dim, gap
self.transform = transform
if validate:
self.train_data = df.values
else:
self.train_data = df.values
def get_patches_and_coordinates(self, image, patch_dim, gap):
patch_loc_arr = [(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3)]
patch_coordinates = []
offset_x, offset_y = image.shape[0] - (patch_dim*3 + gap*2), image.shape[1] - (patch_dim*3 + gap*2)
start_grid_x, start_grid_y = 9, 9
patch_bucket = np.empty([9, 3, 96, 96], dtype='float32')
for i, (tempx, tempy) in enumerate(patch_loc_arr):
tempx, tempy = patch_loc_arr[i]
patch_x_pt = start_grid_x + patch_dim * (tempx-1) + gap * (tempx-1)
patch_y_pt = start_grid_y + patch_dim * (tempy-1) + gap * (tempy-1)
patch_coordinates.append([patch_x_pt, patch_y_pt])
img_patch = image[patch_x_pt:patch_x_pt+patch_dim, patch_y_pt:patch_y_pt+patch_dim]
# Dropped color channels 2 and 3 and replaced with gaussian noise(std ~1/100 of the std of the remaining channel)
img_patch[:, :, 1] = np.random.normal(0.485, 0.01 * np.std(img_patch[:, :, 0]), (img_patch.shape[0],img_patch.shape[1]))
img_patch[:, :, 2] = np.random.normal(0.485, 0.01 * np.std(img_patch[:, :, 0]), (img_patch.shape[0],img_patch.shape[1]))
# Resizing the patch to 96x96
if img_patch.shape[0] != 96:
img_patch = skimage.transform.resize(img_patch, (96, 96))
img_patch = img_as_float32(img_patch)
patch_bucket[i] = np.transpose(img_patch, (2, 0, 1))
return patch_bucket, np.array(patch_coordinates)
def __len__(self):
return len(self.train_data)
def __getitem__(self, index):
image = np.array(Image.open(self.train_data[index]).convert('RGB'))
patch_bucket, coordinates = self.get_patches_and_coordinates(image, self.patch_dim, self.gap)
coordinates = coordinates.astype(np.float32)
return patch_bucket, coordinates
traindataset = ProjectionDataset(Config.patch_dim, Config.gap, df_trn['filename'], False,
transforms.Compose([transforms.ToTensor(),
])
)
trainloader = torch.utils.data.DataLoader(traindataset,
batch_size=Config.batch_size,
shuffle=True,
)
valdataset = ProjectionDataset(Config.patch_dim, Config.gap, df_val['filename'], True,
transforms.Compose([transforms.ToTensor(),
])
)
valloader = torch.utils.data.DataLoader(valdataset,
batch_size=Config.batch_size,
shuffle=False)
In this model, we dropped color channels 2 and 3 and replaced with gaussian noise (std ~1/100 of the std of the remaining channel)
class ProjectionNetwork(nn.Module):
def __init__(self):
super(ProjectionNetwork, self).__init__()
self.cnn = nn.Sequential(
nn.Conv2d(3, 96, kernel_size=11, stride=4),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.LocalResponseNorm(96),
nn.Conv2d(96, 384, kernel_size=5, stride = 2,padding = 2),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.LocalResponseNorm(384),
nn.Conv2d(384, 384, kernel_size=3, stride=1,padding = 1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(384),
nn.Conv2d(384, 384, kernel_size=3, stride=1,padding = 1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(384),
nn.Conv2d(384, 256, kernel_size=3, stride=1,padding = 1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(256),
nn.MaxPool2d(kernel_size=3, stride=2,padding = 1),
)
self.fc6 = nn.Sequential(
nn.Linear(256,4096),
nn.ReLU(inplace=True),
nn.BatchNorm1d(4096),
)
self.fc = nn.Sequential(
nn.Linear(4096, 2)
)
def forward(self, patch_bunch):
bs, _, _, _ = patch_bunch.shape
output = self.cnn(patch_bunch)
output = output.view(bs, -1)
output = self.fc6(output)
output = self.fc(output)
return output
modelPROJ = ProjectionNetwork().to(device)
###############################################
# Load pretrained weights of the previous model
###############################################
m = model.state_dict()
mc = modelPROJ.state_dict()
for idx, k in enumerate(m.keys()):
if idx < 32:
mc[k].copy_(m[k])
else:
print(f'{k} layer weights NOT saved')
# Freezing pretrained layers except the last fc layer
# Run this if you are loading the pretrained weights of the prev model
for param in modelPROJ.cnn.parameters():
param.requires_grad = False
for param in modelPROJ.fc6.parameters():
param.requires_grad = False
optimizer = optim.SGD(modelPROJ.fc.parameters(), lr=0.01, momentum=0.9)
criterion = nn.MSELoss()
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
mode='min',
patience=5,
factor=0.3, verbose=True)
from tqdm.notebook import tqdm
global_trn_loss = []
global_val_loss = []
# previous_val_loss = 100
for epoch in range(Config.num_epochs):
train_running_loss = []
val_running_loss = []
start_time = time.time()
modelPROJ.train()
for idx, data in tqdm(enumerate(trainloader), desc='Training', total=int(len(traindataset)/Config.batch_size)):
bs, ncrops, c, h, w = data[0].size()
bs, v1, v2 = data[1].size()
# Reshape ncrops into batch size
data[0] = data[0].view(-1, c, h, w)
data[1] = data[1].view(-1, v2)
data[0], data[1] = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
output = modelPROJ(data[0])
loss = torch.sqrt(criterion(output, data[1]))
loss.backward()
optimizer.step()
train_running_loss.append(loss.item())
else:
modelPROJ.eval()
with torch.no_grad():
for idx, data in tqdm(enumerate(valloader), desc='Validation', total=int(len(valdataset)/Config.batch_size)):
bs, ncrops, c, h, w = data[0].size()
bs, v1, v2 = data[1].size()
# Reshape ncrops into batch size
data[0] = data[0].view(-1, c, h, w)
data[1] = data[1].view(-1, v2)
data[0], data[1] = data[0].to(device), data[1].to(device)
output = modelPROJ(data[0])
loss = torch.sqrt(criterion(output, data[1]))
val_running_loss.append(loss.item())
global_trn_loss.append(sum(train_running_loss) / len(train_running_loss))
global_val_loss.append(sum(val_running_loss) / len(val_running_loss))
scheduler.step(global_val_loss[-1])
print('Epoch [{}/{}], TRNLoss:{:.4f}, VALLoss:{:.4f}, Time:{:.2f}'.format(
epoch + 1, Config.num_epochs, global_trn_loss[-1], global_val_loss[-1],
(time.time() - start_time) / 60))
if epoch % 20 == 0:
MODEL_SAVE_PATH = f'/gdrive/My Drive/training_models/model_PROJ_base_bs{Config.batch_size}_epochs{Config.num_epochs}_lr{Config.lr}_sd{Config.subset_data}_pd{Config.patch_dim}_g{Config.gap}.pt'
print(f'Model Saved at {MODEL_SAVE_PATH}')
torch.save(
{
'epoch': Config.num_epochs,
'model_state_dict': modelPROJ.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
'global_trnloss': global_trn_loss,
'global_valloss': global_val_loss
}, MODEL_SAVE_PATH)
!ls -la /gdrive/My\ Drive/training_models
plt.plot(range(len(global_trn_loss)), global_trn_loss, label='TRN Loss')
plt.plot(range(len(global_val_loss)), global_val_loss, label='VAL Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Projection Network Training/Validation Loss plot')
plt.legend()
plt.show()
checkpoint = torch.load('/gdrive/My Drive/training_models/model_PROJ_base_bs64_epochs65_lr0.0005_sd950_pd15_g3.pt', map_location=device)
modelPROJ.load_state_dict(checkpoint['model_state_dict'])
# Freezing pretrained layers except the last fc layer
# Run this if you are loading the pretrained weights of the prev model
for param in modelPROJ.cnn.parameters():
param.requires_grad = True
for param in modelPROJ.fc6.parameters():
param.requires_grad = True
optimizer = optim.SGD(modelPROJ.parameters(), lr=0.0001, momentum=0.9)
criterion = nn.MSELoss()
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
mode='min',
patience=5,
factor=0.3, verbose=True)
global_trn_loss = checkpoint['global_trnloss']
global_val_loss = checkpoint['global_valloss']
from tqdm.notebook import tqdm
# global_trn_loss = []
# global_val_loss = []
# previous_val_loss = 100
for epoch in range(Config.num_epochs):
train_running_loss = []
val_running_loss = []
start_time = time.time()
modelPROJ.train()
for idx, data in tqdm(enumerate(trainloader), desc='Training', total=int(len(traindataset)/Config.batch_size)):
bs, ncrops, c, h, w = data[0].size()
bs, v1, v2 = data[1].size()
# Reshape ncrops into batch size
data[0] = data[0].view(-1, c, h, w)
data[1] = data[1].view(-1, v2)
data[0], data[1] = data[0].to(device), data[1].to(device)
optimizer.zero_grad()
output = modelPROJ(data[0])
loss = torch.sqrt(criterion(output, data[1]))
loss.backward()
optimizer.step()
train_running_loss.append(loss.item())
else:
modelPROJ.eval()
with torch.no_grad():
for idx, data in tqdm(enumerate(valloader), desc='Validation', total=int(len(valdataset)/Config.batch_size)):
bs, ncrops, c, h, w = data[0].size()
bs, v1, v2 = data[1].size()
# Reshape ncrops into batch size
data[0] = data[0].view(-1, c, h, w)
data[1] = data[1].view(-1, v2)
data[0], data[1] = data[0].to(device), data[1].to(device)
output = modelPROJ(data[0])
loss = torch.sqrt(criterion(output, data[1]))
val_running_loss.append(loss.item())
global_trn_loss.append(sum(train_running_loss) / len(train_running_loss))
global_val_loss.append(sum(val_running_loss) / len(val_running_loss))
scheduler.step(global_val_loss[-1])
print('Epoch [{}/{}], TRNLoss:{:.4f}, VALLoss:{:.4f}, Time:{:.2f}'.format(
epoch + 1, Config.num_epochs, global_trn_loss[-1], global_val_loss[-1],
(time.time() - start_time) / 60))
if epoch % 20 == 0:
MODEL_SAVE_PATH = f'/gdrive/My Drive/training_models/model_PROJ_full_network_bs{Config.batch_size}_epochs{Config.num_epochs}_lr{Config.lr}_sd{Config.subset_data}_pd{Config.patch_dim}_g{Config.gap}.pt'
print(f'Model Saved at {MODEL_SAVE_PATH}')
torch.save(
{
'epoch': Config.num_epochs,
'model_state_dict': modelPROJ.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
'global_trnloss': global_trn_loss,
'global_valloss': global_val_loss
}, MODEL_SAVE_PATH)
plt.plot(range(len(global_trn_loss)), global_trn_loss, label='TRN Loss')
plt.plot(range(len(global_val_loss)), global_val_loss, label='VAL Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Projection Network Training/Validation Loss plot')
plt.legend()
plt.show()
model = AlexNetwork().to(device)
random_net = AlexNetwork().to(device)
checkpoint = torch.load('/gdrive/My Drive/training_models/model_colab_300.pt', map_location='cuda')
model.load_state_dict(checkpoint['model_state_dict'])
alexnet = models.alexnet(pretrained=True)
new_classifier = nn.Sequential(*list(alexnet.classifier.children())[:-1])
alexnet.classifier = new_classifier
alexnet.to(device)
# Create 2 loaders
# One is used for calculating vectors
# Second one is used for comparing the vectors
std = np.array([np.ones((96,96))*0.229, np.ones((96,96))*0.224, np.ones((96,96))*0.225])
mean = np.array([np.ones((96,96))*0.485, np.ones((96,96))*0.456, np.ones((96,96))*0.406])
dataloader_1 = DataLoader(traindataset,
shuffle=True,
batch_size=64)
dataloader_2 = DataLoader(traindataset,
shuffle=True,
batch_size=64)
data_iter_1 = iter(dataloader_1)
data_iter_2 = iter(dataloader_2)
example_batch = next(data_iter_1)
# lists for storing vectors of pre-trained, trained and random model
vectors_pretrained = []
vectors_trained = []
vectors_random = []
for j, data in enumerate(dataloader_2,0):
img0, img1 , label = data
#print(label)
label = label.reshape([-1])
img0, img1 , label = img0.float().to(device), img1.float().to(device) , label.long().to(device)
# Output3 contains vector for trained model
output ,output1,output3= model(img0,img1)
# Output2 contains vector for pre-trained model
output2 = alexnet(img1.float())
# Output4 contains vector for a random initialization
output ,output1,output4= random_net(img0,img1)
img1 = img1.cpu().detach().numpy()
#All vectors are stored in numpy format
output2 = output2.cpu().detach().numpy()
output3 = output3.cpu().detach().numpy()
output4 = output4.cpu().detach().numpy()
for i in range(len(output2)):
vectors_pretrained.append([img1[i],output2[i]])
for i in range(len(output3)):
vectors_trained.append([img1[i],output3[i]])
for i in range(len(output4)):
vectors_random.append([img1[i],output4[i]])
#example.shape
img0 , img1 , label = example_batch
label = label.reshape([-1])
img0, img1 , label = img0.float().to(device), img1.float().to(device) , label.long().to(device)
#Calculate the reference vector for each model similar to the above method.
output2 = alexnet(img1)
output_pretrained = output2.cpu().detach().numpy()
output ,output1,output3= model(img0,img1)
output ,output1,output4= random_net(img0,img1)
img1 = img1.cpu().detach().numpy()
output_trained = output3.cpu().detach().numpy()
output_random = output4.cpu().detach().numpy()
for i in range(20):
#Sort the vectors according to the euclidean distance between the vectors
vectors_pretrained.sort(key=lambda tup: np.linalg.norm(tup[1]-output_pretrained[i]))
vectors_trained.sort(key=lambda tup: np.linalg.norm(tup[1]-output_trained[i]))
vectors_random.sort(key=lambda tup: np.linalg.norm(tup[1]-output_random[i]))
npimg = img1[i]
fig = plt.figure(figsize=(20.,20.))
ax1 = fig.add_subplot(1,13,1)
#Plot each image after denormalizing.
plt.axis("off")
plt.title("original")
plt.grid()
ax1.imshow(np.transpose(np.clip(np.multiply(npimg,std)+mean,0,1), (1, 2, 0)))
#First loop plots similar images according random initialization
for j in range(1,5):
ax1 = fig.add_subplot(1,13,j+1)
ax1.imshow(np.transpose(np.clip(np.multiply(vectors_random[j-1][0],std)+mean,0,1), (1, 2, 0)))
plt.axis("off")
plt.title("random")
#First loop plots similar images according to pretrained model
for j in range(1,5):
ax1 = fig.add_subplot(1,13,j+5)
ax1.imshow(np.transpose(np.clip(np.multiply(vectors_pretrained[j-1][0],std)+mean,0,1), (1, 2, 0)))
plt.axis("off")
plt.title("alexnet")
#First loop plots similar images according to our trained model
for j in range(1,5):
ax1 = fig.add_subplot(1,13,j+9)
ax1.imshow(np.transpose(np.clip(np.multiply(vectors_trained[j-1][0],std)+mean,0,1), (1, 2, 0)))
# ax1.imshow(np.transpose(unorm(torch.tensor(vectors[j-1][0])), (1, 2, 0)))
plt.axis("off")
plt.title("ours")
plt.show()
We have plotted the similar images according to different models. The original image is the input image. Random denotes the similar images according to random initialization of the model. Similarly we plot the images of pre-trained alexnet model and our trained model. We can see that even the similar images according to randomly initialized model also fares comparatively better. This is because of the fact that the vectors of each image is some non linear function of the input image. As the alex net is trained on the entire imagenet dataset the features obtained from it are very good. The model which we have obtained is by using 10% of tiny imagenet which is a tiny version of imagenet. So the model we have trained is on a small set and low dimensional images. This was due to the constraint we had on computational resources. The main take away is that we are able to obtain alex-net level features by training a model with smaller number of trainable parameters. Another interesting thing is that random initialization also gives comparable results.
def get_patches_and_coordinates(image, patch_dim, gap):
patch_loc_arr = [(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3)]
patch_coordinates = []
offset_x, offset_y = image.shape[0] - (patch_dim*3 + gap*2), image.shape[1] - (patch_dim*3 + gap*2)
start_grid_x, start_grid_y = 9, 9
patch_bucket = np.empty([9, 3, 96, 96], dtype='float32')
for i, (tempx, tempy) in enumerate(patch_loc_arr):
tempx, tempy = patch_loc_arr[i]
patch_x_pt = start_grid_x + patch_dim * (tempx-1) + gap * (tempx-1)
patch_y_pt = start_grid_y + patch_dim * (tempy-1) + gap * (tempy-1)
patch_coordinates.append([patch_x_pt, patch_y_pt])
img_patch = image[patch_x_pt:patch_x_pt+patch_dim, patch_y_pt:patch_y_pt+patch_dim]
# Resizing the patch to 96x96
if img_patch.shape[0] != 96:
img_patch = skimage.transform.resize(img_patch, (96, 96))
img_patch = img_as_float32(img_patch)
patch_bucket[i] = np.transpose(img_patch, (2, 0, 1))
return patch_bucket, np.array(patch_coordinates)
def inference_patch(image_name):
image = np.array(Image.open(image_name))
patch_bucket, coordinates = get_patches_and_coordinates(image, Config.patch_dim, Config.gap)
return patch_bucket, coordinates
p_b, coor = inference_patch(df_val['filename'][5])
!ls /gdrive/My\ Drive/training_models
# Loading Chromatic Aberration Network weights
checkpoint = torch.load('/gdrive/My Drive/training_models/model_CA_full_network_bs64_epochs65_lr0.0005_sd1000_pd15_g3.pt', map_location=device)
modelCAN.load_state_dict(checkpoint['model_state_dict'])
coor_can = modelCAN(torch.from_numpy(p_b).to(device))
print(f'x, y coordinates of the patches {coor_can}')
# Loading Projection Network weights
checkpoint = torch.load('/gdrive/My Drive/training_models/model_PROJ_full_network_bs64_epochs65_lr0.0005_sd950_pd15_g3.pt', map_location=device)
modelPROJ.load_state_dict(checkpoint['model_state_dict'])
coor_proj = modelPROJ(torch.from_numpy(p_b).to(device))
print(f'x, y coordinates of the patches {coor_proj}')
display_canvas(convert_format(p_b, format='n'), coor, title='Original Image patch locations')
display_canvas(convert_format(p_b, format='n'), coor_can.cpu().detach().numpy(), title='Predicted Chromatic Aberration network patch locations')
display_canvas(convert_format(p_b, format='n'), coor_proj.cpu().detach().numpy(), title='Predicted Projection Network patch locations')
First image canvas contains the patches of the original image. Each patch is aligned to its original position (x, y - coordinate).
Second image canvas contains the plotting of the patchs and their coordinates predicted from the Chromatic Aberration Network (CAN). In this method color channels, are retained for training.
Third image canvas contains the plotting of the patches and their coordinates predicted from the Projection Network (PN). In this, we dropped two color channels and replaced with gaussian noise (std ~1/100 of the std of the remaining channel).
Comparing the outputs of both CAN and PN, it can be clearly seen that there is an issue of color aberration in the training.
In the second image, the color channel is retained, due to which the model learned the color aberration instead of the semantics of the image. The model predicted the patches (x, y) coordinates almost correctly as compared to the original patch location.
In Projection Network we removed the channels, so CNN was not able to learn the real features of the patches. The prediction was greatly affected by removing the channels and ending up in the same (x, y) coordinate.
Due to this chromatic anomaly, the CNN doesn't learn the semantics of the image (Binocular Image) in patch wise learning, in lieu of it learned the chromatic aberration which is easier for CNN to learn that the semantics.
While solving a real world problem, if we have ample amounts of labeled data available, then the easiest way out to solve the problem is a supervised learning approach. Retrospecting the past decade or so, though supervised learning approaches have shown promising results, it is highly likely that either we don’t have labeled data available or the cost of labeled data is too high. To tackle this labeled data unavailability, we can resort to another approach called as unsupervised learning which aims on using the unlabeled data for performing the task. Despite that, unsupervised methods have not been up to the mark in terms of performance. After all, without labels, it is not even clear what to represent. If we improvise and design the model in such a way that capturing the representation is encouraged, our model will become better.
Moreover, whenever possible, we should try to train the network using a pre-trained model and then later fine tune it. If it works and we don’t have to train from the scratch, this will help us in many ways by helping us converge faster, providing better results, requiring less amount of data for training. But the problem with this approach is that we don't have pre-trained models for each and every domain. We can either use the first few layers of a model trained on some other domain or we can use data in such a way that the labels become part of the dataset. This is where the “pretext task” plays an important role. **In our work, we aim to provide a self-supervised formulation by predicting the context of patch as a “pretext task”.** We sample a random pair of patches from 9 patches of an image and try to predict the position of one patch with respect to the other one. This pretext task will help our model understand the features well which can be used to solve other real world problems.
The first task includes handling the dataset where we transform it into a form where it is operable by the model directly. We manipulated the dataset in a way that the model does not seek trivial solutions but rather try to learn the true feature space. Then we presented a ConvNet based architecture for pair classification and trained our model on a mini-ImageNet dataset. This network will learn to predict the context of patches of image with respect to one another. To check whether the network learns accurately, we tested it on a nearest neighbour approach. Due to chromatic aberration, our network was not able to bring true nearest neighbours. We tackled the problem by training the network to predict the absolute coordinates of patches sampled from the dataset.
The experimentation results showed that the nearest neighbour approach was able to provide visually similar representations as the network was able to learn the semantics properly. Earlier it tried to learn the chromatic aberration instead of learning the true semantics of an image. We avoided this by dropping color channels 2 and 3 and replacing it with gaussian noise (std ~1/100 of the std of the remaining channel).
[1] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
[2] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool. Domain adaptive faster R-CNN for object detection in the wild. In Conference on Computer Vision and Pattern Recognition(CVPR), 2018.
[3] D. Dai and L. Van Gool. Dark model adaptation: Semantic image segmentation from daytime to nighttime. arXiv preprint arXiv: 1810.02575, 2018.
[4] R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 2005.
[5] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, 2008.
[6] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
[7] D. Okanohara and J. Tsujii. A discriminative language model with pseudo negative samples. In ACL, 2007.
[8] F. Ebert, S. Dasari, A.X. Lee, S. Levine, and C. Finn. Robustness via retrying: Closed-loop robotic manipulation with self-supervised learning. Conference on Robot Learning (CoRL), 2018.
[9] E. Jang, C. Devin, V. Vanhoucke, and S. Levine. Grasp2Vec: Learning object representations from self-supervised grasping. In Conference on Robot Learning, 2018.
[10] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg. Making sense of vision and touch: Self-supervised learning of multi modal representations for contact rich tasks. arXiv preprint arXiv: 1810.10191, 2018.
[11] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time contrastive networks: Self-supervised learning from video. arXiv preprint arXiv: 1704.06888, 2017.
[12] A. Owens and A.A. Efros. Audio-visual scene analysis with self-supervised multisensory features. European Conference on Computer Vision (ECCV), 2018.
[13] N. Sayed, B. Brattoli, and B. Ommer. Cross and learn: Cross modal self-supervision. arXiv preprint arXiv: 1811.03879, 2018.
[14] B. Korbar, D. Tran, and L. Torresani. Cooperative learning of audio and video models from self-supervised synchronization. arXiv preprint arXiv:1807.00230, 2018.
[15] G. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 2006.
[16] R. Salakhutdinov and G.E. Hinton. Deep boltzmann machines. In ICAIS, 2009.
[17] G.E. Hinton, P. Dayan, B.J. Frey, and R.M. Neal. The “wake sleep” algorithm for unsupervised neural networks. Proceedings. IEEE, 1995.
[18] D.P. Kingma and M. Welling. Auto encoding variational bayes. 2014.
[19] D.J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. ICML, 2014.
[20] Y. Bengio, E. Thibodeau Laufer, G. Alain, and J. Yosinski. Deep generative stochastic networks trainable by backprop. ICML, 2014.
[21] P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
[22] Q.V. Le. Building high-level features using large scale unsupervised learning. In ICASSP, 2013.
[23] H. Lee, A. Battle, R. Raina, and A.Y. Ng. Efficient sparse coding algorithms. In NIPS, 2006.
[24] C. Doersch, A. Gupta, and A.A. Efros. Context as supervisory signal: Discovering objects with predictable context. In ECCV. 2014.
[25] J. Domke, A. Karapurkar, and Y. Aloimonos. Who killed the directed model? In CVPR, 2008.
[26] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In AISTATS, 2011.
[27] L. Theis and M. Bethge. Generative image modeling using spatial lstms. In NIPS, 2015.
[28] T. Malisiewicz and A. Efros. Beyond categories: The visual memex model for reasoning about object relationships. In NIPS, 2009.